Clean up benchmarks #465

pitdicker · 2018-05-17T19:56:15Z

This includes the change as discussed for the gen_bool en Bernoulli benchmarks.

Also it turns out almost all our uses of black_box were unnecessary, it is enough to return a variable or some accumulator at the end of the benchmark run.

Results with cargo benchcmp --threshold 2:

 name                             master ns/iter       new ns/iter          diff ns/iter   diff %  speedup 
 distr_standard_alphanumeric      2,430 (1646 MB/s)    1,803 (2218 MB/s)            -627  -25.80%   x 1.35 
 distr_standard_bool              4,379 (228 MB/s)     1,055 (947 MB/s)           -3,324  -75.91%   x 4.15 
 distr_uniform_i64                5,497 (1455 MB/s)    5,384 (1485 MB/s)            -113   -2.06%   x 1.02 
 gen_u32_hc128                    2,498 (1601 MB/s)    2,265 (1766 MB/s)            -233   -9.33%   x 1.10 
 gen_u32_isaac                    3,448 (1160 MB/s)    3,593 (1113 MB/s)             145    4.21%   x 0.96 
 gen_u32_std                      2,223 (1799 MB/s)    2,508 (1594 MB/s)             285   12.82%   x 0.89 
 gen_u64_jitter                   23,335               23,998                        663    2.84%   x 0.97 
 init_chacha                      31                   27                             -4  -12.90%   x 1.15 
 init_hc128                       5,072                4,852                        -220   -4.34%   x 1.05 
 init_isaac                       1,454                1,417                         -37   -2.54%   x 1.03 
 init_jitter                      23,329               24,014                        685    2.94%   x 0.97 
 misc_bernoulli_const             2,234                3,873                       1,639   73.37%   x 0.58 
 misc_bernoulli_var               2,922                6,216                       3,294  112.73%   x 0.47 
 misc_gen_bool_const              2,201                4,221                       2,020   91.78%   x 0.52 
 misc_gen_bool_var                4,469                6,211                       1,742   38.98%   x 0.72 
 misc_sample_slice_ref_10_of_100  158                  167                             9    5.70%   x 0.95 
 thread_rng_u32                   3,197 (1251 MB/s)    3,461 (1155 MB/s)             264    8.26%   x 0.92

The misc_gen_bool and misc_bernoulli benchmarks are clearly different (in part because of using StdRng).
Benchmarking distributions producing a bool or char using an accumulator gives more realistic results.
Hc128Rng and StdRng produce results somewhere between 1250 MB/s and 1800 MB/s, without recompilation. Still don't know what causes the variance.
I didn't dare to remove black_box from the gen_bytes benchmarks, it improves performance a bit more than I think is plausible, but I didn't want to figure out the assembly for those...

dhardy · 2018-05-18T06:04:21Z

benches/distributions.rs

+                let mut accum = 0u32;
+                for _ in 0..::RAND_BENCH_N {
+                    let x: char = distr.sample(&mut rng);
+                    accum = accum.wrapping_add(x as u32);


if this is the only difference, I think you could just do the same for booleans, though doesn't really matter

dhardy · 2018-05-18T06:40:43Z

benches/misc.rs

        for _ in 0..::RAND_BENCH_N {
-            black_box(rng.gen_bool(p));
+            accum ^= rng.gen_bool(p);
+            p += 0.0001;


Adjusting p seems sensible, but it adds an extra op, making this incomparable with misc_gen_bool_const. Was the old method not okay?

Although LLVM didn't move the range set-up code out of the inner loop, I see no reason why it couldn't. Changing p makes sure it can not.

Funny thing: using black_box before the for loop is slower than incrementing p each round.

pitdicker · 2018-05-19T16:27:27Z

Updated to fold the char and bool distribution macros back into one.

dhardy · 2018-05-21T07:16:06Z

benches/misc.rs

        for _ in 0..::RAND_BENCH_N {
-            black_box(rng.sample(d));
+            let d = rand::distributions::Bernoulli::new(p);


In this case I think the point was to check what affect using an unknown p has over using a constant, not to adjust the distribution each loop and do a single call for each instance.

That is also possible to benchmark, but then we should expect the benchmark to give the same result as the const version. Does not seem like something all that useful to measure to me.

dhardy reviewed May 18, 2018

View reviewed changes

Clean up benchmarks

9263b5c

pitdicker force-pushed the benchmarks branch from bd9da68 to 9263b5c Compare May 19, 2018 16:26

dhardy reviewed May 21, 2018

View reviewed changes

dhardy mentioned this pull request May 21, 2018

Improve Bernoulli benchmarks #448

Closed

dhardy merged commit 8f57a13 into rust-random:master May 21, 2018

pitdicker deleted the benchmarks branch May 21, 2018 12:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up benchmarks #465

Clean up benchmarks #465

pitdicker commented May 17, 2018

dhardy May 18, 2018

dhardy May 18, 2018

pitdicker May 19, 2018

pitdicker commented May 19, 2018

dhardy May 21, 2018

pitdicker May 21, 2018

Clean up benchmarks #465

Clean up benchmarks #465

Conversation

pitdicker commented May 17, 2018

dhardy May 18, 2018

Choose a reason for hiding this comment

dhardy May 18, 2018

Choose a reason for hiding this comment

pitdicker May 19, 2018

Choose a reason for hiding this comment

pitdicker commented May 19, 2018

dhardy May 21, 2018

Choose a reason for hiding this comment

pitdicker May 21, 2018

Choose a reason for hiding this comment